html_summary <- function(model, title = "Model Summary") {
  summary_text <- capture.output(summary(model))
  
  cat(glue::glue("<h4>{title}</h4>"))
  cat('<div style="border:1px solid #ccc; padding:10px; max-height:300px; overflow:auto; font-family:monospace; background:#f9f9f9;">')
  cat("<pre>")
  cat(paste(summary_text, collapse = "\n"))
  cat("</pre>")
  cat("</div>")
}

Question 1

a). Explore with appropriate graphical tools the correlation between the variables

Step 1: Load Data Subset Variables

# Load data
load("../data/Residen.RData")
Data <- Residen

# Extract variables by category
time_vars <- colnames(Data)[1:4]    # Time variables (START YEAR, START QUARTER, etc.)
project_vars <- paste0("V", 1:8)    # Project physical/financial variables (V1-V8)
economic_lag_vars <- list(          # Define economic lag groups (Lag1-Lag5)
  Lag1 = paste0("V", 9:27),  # V9-V27
  Lag2 = paste0("V", 28:46),  # V28-V46
  Lag3 = paste0("V", 47:65),  # V47-V65
  Lag4 = paste0("V", 66:84),  # V66-V84
  Lag5 = paste0("V", 85:103))  # V85-V103     
output_vars <- c("V104", "V105")     # Output variables (sales price and construction costs)


#  Define a Reusable Function for Correlation Heatmaps

plot_cor_heatmap <- function(data, vars, title = "Correlation Heatmap", 
                             k_col = 2, k_row = 2, 
                             fontsize_row = 8, fontsize_col = 8,
                              width = 800, height = 600) {
   # Define color gradient: red-white-blue
  custom_palette <- colorRampPalette(c(
    "#FF0000",  # Red (for -1)
    "#FFFFFF",  # White (for 0)
    "#00008B"   # Dark blue (for 1.0)
))(50)
  
  # Extract subset of variables
  selected_vars <- data[, vars, drop = FALSE]
  
  # Calculate correlation matrix
  cor_matrix <- cor(selected_vars, use = "complete.obs")
  
  # Plot interactive heatmap
  heatmaply(
    cor_matrix,
    colors = custom_palette,  # Apply custom color
    limits = c(-1, 1),        # Force color range to [-1,1]
    k_col = k_col,
    k_row = k_row,
    fontsize_row = fontsize_row,
    fontsize_col = fontsize_col,
    main = title,
    show_dendrogram = c(TRUE, TRUE), # Show row/column dendrograms
    width = width,
    height = height
  )
 }

Step 2: Visualize Correlations by Category

1.Time Variables vs. Outputs

# Analyze Sales Price (V104) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V104)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Sales Price by Starting Year", x = "Year", y = "Sales Price (V104)")

# Analyze Sales Price (V104) by starting quarter  
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V104)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Sales Price by Starting Quarter", x = "Quarter", y = "Sales Price (V104)")

# Analyze Construction Costs (V105) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V105)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Construction Costs by Starting Year", x = "Year", y = "Construction Costs (V105)")

# Analyze Construction Costs (V105) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V105)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Construction Costs by Starting Quarter", x = "Quarter", y = "Construction Costs (V105)")

cor_time_output <- plot_cor_heatmap(
  data = Data, 
  vars = c(time_vars, output_vars),
  title = " Correlation: Time Variables vs. Outputs")

cor_time_output

Time Variables vs. Output Variables – Observations

From the heatmap and boxplots, several patterns emerge:

The heatmap shows almost a perfect positive correlation (r = 0.988) between start year and completion year This makes sense, as construction durations are relatively stable, leading to highly aligned project start and completion times.

Actual Sales Price (V104) and Construction Cost (V105) correlate strong (r = 0.796), confirming that higher costs typically drive higher sales prices.

The boxplots show an overall upward trend in both sales prices and construction costs across years, which is consistent with inflation and general increases in construction expenses over time.

The heatmap further reveals:

Start Year is moderately correlated with Actual Sales Price (r = 0.607) and strongly correlated with Actual Construction Cost (r = 0.753).

Completion year shows similar correlations: r = 0.613 with sales price and r = 0.769 with construction cost.

When examining seasonal effects, Sales Prices are fairly stable across quarters, though Quarter 1 shows a slightly higher median. In contrast, Construction Costs tend to be higher in Quarter 1 and Quarter 4, possibly due to seasonal budgeting needs or increased costs during colder months.

2.Project Variables vs. Outputs

cor_project_output <- plot_cor_heatmap(
  data = Data, 
  vars = c(project_vars, output_vars),
  title = " Correlation: Project Variables vs. Outputs")

cor_project_output

Projects Variables vs. Output Variables – Observations

Based on the correlation heatmap and the variables described, here’s some pattern I found:

The correlation between Actual Sales Price (V104) and preliminary estimated construction cost (V5) is r = 0.785, while the correlation between V104 and initial unit price (V8) is extremely high at r = 0.976. This suggests that initial unit price(V8) is an excellent predictor of final sales prices(V104). For Actual Construction Costs (V105), the correlations with preliminary cost estimates vary: r = 0.602 with V4 (total preliminary estimated cost) and r = 0.963 with V5 (preliminary estimated construction cost). This indicates that V5 is a much more reliable predictor of v105. V105 (Actual Construction Cost) also has a strong correlation with initial unit price (V8) at r = 0.790. Project locality (V1) shows negative correlations with most variables, suggesting location factors influence costs and prices inversely. The correlation between Total floor area (V2) and Lot area (V3) is r = 0.947, indicating a very strong linear relationship. This makes sense, as larger lots typically accommodate buildings with greater floor area.

The clustering in the dendrogram shows V104 and V8 ,V105 and V5, grouped closely together, confirming their strong relationship.

The data demonstrates that preliminary cost estimates (particularly V5) and initial pricing (V8) are highly reliable predictors of final project outcomes, with correlation coefficients above 0.75 in most key relationships.

  1. Economic Variables (Time Lag Analysis)
# Step 1: Save all heatmaply objects into a list
heatmap_list <- lapply(names(economic_lag_vars), function(lag_name) {
  current_vars <- c(economic_lag_vars[[lag_name]], output_vars)
  
  plot_cor_heatmap(
    data = Data,
    vars = current_vars,
    title = paste("Economic Variables:", lag_name),
    k_col = 3,
    k_row = 3
  )
})

# Step 2: Show them one by one 
heatmap_list[[1]]
heatmap_list[[2]]
heatmap_list[[3]]  
heatmap_list[[4]]  
heatmap_list[[5]]  

Economic Variables vs. Output Variables – Observations

Based on the five heatmap visualizations showing economic variables at different time lags (Lag1 through Lag5), I found some interesting patterns about correlations between the economic indicators and project outputs:

1.Interest Rate Effects (V18/V37/V56/V75/V94) consistently show strong negative correlations (deep red) with most other economic variables across all time lags. The negative correlation persists across all five time lags, suggesting a consistent long-term relationshipion persists across all five time lags, suggesting a consistent long-term relationship

2.Throughout all five time periods, variables V10 (Building services index), V11 (Wholesale price index of building materials), V13 (Cumulative liquidity), V14 (Private sector investment in new buildings), V15 (Land price index), V19 (Average construction cost at completion), V20 (Average construction cost at beginning), V23 (Consumer price index), V24 (CPI of housing, water, fuel & power), and V27 (Gold price per ounce) demonstrate extraordinarily high correlation coefficients with each other. These correlation values range from a minimum of 0.88 to a maximum of 0.999, approaching perfect correlation. Not only do these variables show strong internal correlations among themselves, but they also consistently exhibit similar positive correlations with the project output variables V104 (Actual sales prices) and V105 (Actual construction costs) across all time lags. This consistent pattern suggests these ten economic indicators move almost in unison and have a uniformly positive influence on both construction costs and sales prices throughout different time periods. The stability of these relationships across all time lags indicates they represent fundamental and reliable economic drivers in the construction and real estate markets.


2.Across all five time periods, a group of variables — like the Building Services Index (V10), Land Price Index (V15), Gold Price (V27), and several others — show extremely strong correlations with one another, with values as high as 0.999. These indicators clearly move together, almost like a single unit. What’s more, they also maintain consistent, strong positive correlations with key project outcomes like actual sales prices (V104) and construction costs (V105). This suggests that these ten variables are not only tightly linked to each other but also play a major role in influencing both pricing and costs in construction. Their steady behavior across time lags points to them being core economic drivers in the construction and real estate sectors.


2. A core group of variables — including:

  • V10: Building Services Index
  • V11: Wholesale Price Index of Building Materials
  • V13: Cumulative Liquidity
  • V14: Private Sector Investment in New Buildings
  • V15: Land Price Index
  • V19: Average Construction Cost at Completion
  • V20: Average Construction Cost at Beginning
  • V23: Consumer Price Index
  • V24: CPI of Housing, Water, Fuel & Power
  • V27: Gold Price per Ounce

These ten variables consistently show very high internal correlations, ranging from 0.88 to 0.999, suggesting they move almost in unison. They also exhibit strong and consistent positive correlations with:

  • V104: Actual Sales Prices
  • V105: Actual Construction Costs

This pattern holds across all five time lags, highlighting the stability and reliability of these relationships. This consistent pattern suggests these ten economic indicators move almost in unison and have a uniformly positive influence on both construction costs and sales prices throughout different time periods. The stability of these relationships across all time lags indicates they represent fundamental and reliable economic drivers in the construction and real estate markets.


3.The clustering in the dendrogram shows V9 (number of building permits issued) and V12 (total floor areas of building permits issued), as well as V16 (number of loans extended by banks) and V26 (population of the city), grouped closely together, confirming their strong relationships. The color between each pair is deeper blue, while being lighter with other variables, suggesting strong positive correlations within these two groups and relatively weak relationships with other variables. This pattern logically reflects that building permit counts and the total floor area of those permits are inherently linked, while the number of loans extended by banks naturally correlates with the city’s population size.

b). Please fit a linear regression model to explain the “actual sales price” (V104) in terms of the of the other variables excluding the variable “actual construction costs” (V105). Explain the output provided by the summary function.

lr_model <- lm(V104~ . - V105,data = Data)
#summary(lr_model)
html_summary(lr_model)

Model Summary

Call:
lm(formula = V104 ~ . - V105, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-901.15  -52.29   -1.50   45.71  645.31 

Coefficients: (32 not defined because of singularities)
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.004e+05  1.999e+05   0.502 0.615781    
`START YEAR`         -1.614e+03  3.388e+03  -0.476 0.634175    
`START QUARTER`      -4.928e+02  2.457e+03  -0.201 0.841139    
`COMPLETION YEAR`     1.521e+02  1.689e+01   9.007  < 2e-16 ***
`COMPLETION QUARTER`  5.984e+01  8.881e+00   6.738 8.36e-11 ***
V1                   -4.779e+00  2.225e+00  -2.148 0.032529 *  
V2                    6.630e-02  2.195e-02   3.020 0.002746 ** 
V3                   -2.331e-01  6.451e-02  -3.612 0.000356 ***
V4                    6.442e-03  3.570e-02   0.180 0.856905    
V5                   -6.548e-01  3.342e-01  -1.960 0.050975 .  
V6                    8.764e-02  6.322e-02   1.386 0.166727    
V7                           NA         NA      NA       NA    
V8                    1.203e+00  1.681e-02  71.579  < 2e-16 ***
V9                    2.016e-01  1.103e+00   0.183 0.855159    
V10                   1.373e+02  9.675e+02   0.142 0.887245    
V11                   1.048e+01  7.426e+01   0.141 0.887830    
V12                  -1.626e+02  6.837e+02  -0.238 0.812200    
V13                   1.329e-04  1.125e-01   0.001 0.999058    
V14                   1.799e-01  4.339e-01   0.415 0.678643    
V15                  -2.265e+01  5.008e+01  -0.452 0.651433    
V16                   8.696e-01  2.753e+00   0.316 0.752315    
V17                  -4.247e-03  1.860e-01  -0.023 0.981801    
V18                   2.328e+02  1.235e+03   0.188 0.850672    
V19                  -2.007e-01  4.647e+00  -0.043 0.965587    
V20                  -3.910e-01  1.187e+00  -0.329 0.742147    
V21                   6.422e-02  3.565e-01   0.180 0.857174    
V22                   3.682e-03  4.524e-01   0.008 0.993511    
V23                   6.413e+00  6.128e+02   0.010 0.991658    
V24                   4.270e+01  2.347e+02   0.182 0.855747    
V25                   5.267e-02  1.033e-01   0.510 0.610369    
V26                  -4.675e-04  4.268e-01  -0.001 0.999127    
V27                   9.455e-04  9.654e-03   0.098 0.922046    
V28                   5.915e-02  1.795e-01   0.330 0.741930    
V29                  -1.144e+02  8.626e+02  -0.133 0.894615    
V30                  -7.247e+00  1.452e+02  -0.050 0.960240    
V31                  -9.085e+01  3.036e+02  -0.299 0.764953    
V32                  -1.780e-03  8.287e-02  -0.021 0.982879    
V33                  -5.666e-02  3.707e-01  -0.153 0.878630    
V34                  -9.607e+00  1.077e+02  -0.089 0.929008    
V35                   9.323e-01  1.976e+00   0.472 0.637449    
V36                   1.490e-02  8.094e-02   0.184 0.854034    
V37                   7.421e+01  6.343e+02   0.117 0.906942    
V38                  -4.162e-01  6.246e+00  -0.067 0.946913    
V39                   6.458e-01  2.013e+00   0.321 0.748563    
V40                  -8.375e-02  1.877e-01  -0.446 0.655784    
V41                   1.615e-01  5.671e-01   0.285 0.776052    
V42                   1.890e+02  6.380e+02   0.296 0.767239    
V43                  -1.891e+02  8.982e+02  -0.211 0.833373    
V44                  -1.202e-02  5.531e-01  -0.022 0.982673    
V45                  -1.184e-02  3.554e-01  -0.033 0.973441    
V46                  -1.038e-03  8.216e-03  -0.126 0.899575    
V47                   1.040e-01  2.590e-01   0.402 0.688186    
V48                   1.119e+02  1.334e+03   0.084 0.933224    
V49                  -3.006e+01  4.980e+01  -0.604 0.546571    
V50                  -1.746e+02  4.440e+02  -0.393 0.694395    
V51                   6.526e-03  1.409e-01   0.046 0.963102    
V52                  -7.985e-02  4.933e-01  -0.162 0.871519    
V53                   6.439e+00  1.296e+02   0.050 0.960422    
V54                   2.405e+00  4.657e+00   0.516 0.605918    
V55                  -1.698e-02  1.631e-01  -0.104 0.917133    
V56                   3.089e+01  2.974e+02   0.104 0.917333    
V57                  -2.722e-01  2.233e+00  -0.122 0.903072    
V58                   3.539e-01  3.079e+00   0.115 0.908564    
V59                  -9.660e-02  5.540e-01  -0.174 0.861684    
V60                  -2.140e-01  1.307e+00  -0.164 0.870070    
V61                   4.198e+01  1.636e+02   0.257 0.797647    
V62                   2.040e+02  1.364e+03   0.150 0.881170    
V63                  -1.273e-01  8.479e-01  -0.150 0.880716    
V64                  -2.178e-02  3.772e-01  -0.058 0.953994    
V65                  -9.490e-04  8.072e-03  -0.118 0.906483    
V66                   6.260e-02  6.990e-01   0.090 0.928700    
V67                  -2.018e+02  7.758e+02  -0.260 0.794924    
V68                   6.947e+01  7.047e+01   0.986 0.324982    
V69                   1.248e+02  1.397e+02   0.894 0.372099    
V70                  -9.328e-03  4.437e-02  -0.210 0.833629    
V71                  -1.346e-01  4.248e-01  -0.317 0.751583    
V72                  -1.492e+01  2.247e+02  -0.066 0.947118    
V73                          NA         NA      NA       NA    
V74                          NA         NA      NA       NA    
V75                          NA         NA      NA       NA    
V76                          NA         NA      NA       NA    
V77                          NA         NA      NA       NA    
V78                          NA         NA      NA       NA    
V79                          NA         NA      NA       NA    
V80                          NA         NA      NA       NA    
V81                          NA         NA      NA       NA    
V82                          NA         NA      NA       NA    
V83                          NA         NA      NA       NA    
V84                          NA         NA      NA       NA    
V85                          NA         NA      NA       NA    
V86                          NA         NA      NA       NA    
V87                          NA         NA      NA       NA    
V88                          NA         NA      NA       NA    
V89                          NA         NA      NA       NA    
V90                          NA         NA      NA       NA    
V91                          NA         NA      NA       NA    
V92                          NA         NA      NA       NA    
V93                          NA         NA      NA       NA    
V94                          NA         NA      NA       NA    
V95                          NA         NA      NA       NA    
V96                          NA         NA      NA       NA    
V97                          NA         NA      NA       NA    
V98                          NA         NA      NA       NA    
V99                          NA         NA      NA       NA    
V100                         NA         NA      NA       NA    
V101                         NA         NA      NA       NA    
V102                         NA         NA      NA       NA    
V103                         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 148.6 on 296 degrees of freedom
Multiple R-squared:  0.9879,    Adjusted R-squared:  0.9848 
F-statistic: 321.9 on 75 and 296 DF,  p-value: < 2.2e-16

From the summary of the linear regression model, we can find that:

1.Residuals: Residuals are roughly symmetric around zero (median = -1.50), indicating indicating unbiased predictions on average.However, residuals range widely (-901.15 to 645.31), suggesting the model struggles with extreme values.

2.Coefficients and Significance: COMPLETION YEAR (β = 152.1, p < 0.001) and COMPLETION QUARTER (β = 59.84, p < 0.001): These two variables both have a significant positive impact on the model, indicating that the completion year and quarter are crucial predictors of the outcome. V3(β = -0.2331, p < 0.001) and V8(β = 1.203, p < 0.001):V3 has a significant negative impact, while V8 has a significant positive impact,, with V3 stands for Lot area and V8 to Price of the unit at the beginning of the project per m2. V1(β = -4.779, p < 0.05) and V2(β = 0.0663, p < 0.05):V1 has a significant negative impact, while V2 has a significant positive impact. at the 0.05 level, with V1 stands for project locality and V2 to total floor area of the buildings. V5(β = -0.6548, p < 0.1) :V5 has a marginally significant negative impact,stands for Preliminary estimated construction cost based on the prices at the beginning of the project. The linear regression model supports the findings from my previous data visualization analysis.

3.Model Performance: Adjusted R-squared: 0.9848: This tells us that the model explains about 98.48% of the changes in actual sales prices (V104). This shows the model is a good fit for the data.But High R-squared with excessive predictors (75 variables) may harm generalizability. We need careful about the overfitting risk for this model. F-statistic = 321.9 with a p-value < 2.2e-16, indicating that Model is statistically significant overall.

4.Multicollinearity and undefined coefficients: There are undefined coefficients for 32 predictor variables. This points to a multicollinearity issue. This means some variables are very closely related, making it hard to tell their separate effects. We can use VIF to check for multicollinearity and think about removing or combining variables that are highly correlated.

c). Fit a linear regression model to explain the “actual sales price” (V104) in terms of other variables (excluding the variable “actual construction costs” (V105)) using (a) backwards selection and (b) stepwise selection. Compare these two models in terms of outputs, computational time and holdout mean square error. (Hint: summarising your results in a table allows for easy comparison).

# Split the data 
set.seed(123)  # Set seed for reproducibility
train_index <- sample(1:nrow(Data), 0.8 * nrow(Data))
train_data <- Data[train_index, ]
test_data <- Data[-train_index, ]

# Define full model 
full_model <- lm(V104 ~ . -V105, data = train_data)

# (a) Backwards Selection
start_time_backward <- Sys.time()
backward_model <- stepAIC(full_model, direction = "backward", trace = 0)
end_time_backward <- Sys.time()
backward_time <- end_time_backward - start_time_backward

backward_pred <- predict(backward_model, newdata = test_data)
backward_mse <- mean((backward_pred - test_data$V104)^2)


# (b) Stepwise Selection
start_time_stepwise <- Sys.time()
stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
end_time_stepwise <- Sys.time()
stepwise_time <- end_time_stepwise - start_time_stepwise

stepwise_pred <- predict(stepwise_model, newdata = test_data)
stepwise_mse <- mean((stepwise_pred - test_data$V104)^2)


# Compare models
comparison_results <- data.frame(
  Model = c("Backward", "Stepwise"),
  Variables = c(length(coef(backward_model)), length(coef(stepwise_model))),
  R_Squared = c(summary(backward_model)$r.squared, summary(stepwise_model)$r.squared),
  Adj_R2 = c(summary(backward_model)$adj.r.squared, summary(stepwise_model)$adj.r.squared),
  Time = c(backward_time, stepwise_time),
  AIC = c(AIC(backward_model), AIC(stepwise_model)),
  BIC = c(BIC(backward_model), BIC(stepwise_model)),
  Holdout_MSE = c(backward_mse, stepwise_mse) 
  )
  
print(comparison_results)
##      Model Variables R_Squared    Adj_R2          Time      AIC      BIC
## 1 Backward        44 0.9886593 0.9867319 6.705970 secs 3793.543 3959.760
## 2 Stepwise        23 0.9886212 0.9877076 8.767907 secs 3752.539 3841.189
##   Holdout_MSE
## 1    89101.99
## 2    69494.02

Base on the table above, it can be found that:

Model Complexity: The Backward Selection model is more complex,retained 44 variables, while the Stepwise Selection model is simpler but efficient, including fewer predictors retained 23 variables, resulting in a more parsimonious model, which may improve interpretability and generalizability.

Fit Quality: Both models have almost identical R² values, showing similar explanatory power. However, Stepwise has a higher adjusted R², which accounts for model complexity

Model Selection Criteria (AIC & BIC): Both AIC and BIC are lower for the Stepwise model, suggesting it achieves better fit with fewer variables.That means it strikes a better balance between model complexity and goodness of fit.

Holdout MSE : The Stepwise model has a much lower MSE on the holdout set, suggesting better predictive performance and generalizability.

Computation Time: Backward was faster, but given the other metrics, this difference (~3.5s) is likely negligible for most practical purposes.

The Stepwise model is preferred overall: it’s simpler, generalizes better, and has better performance on most criteria (especially adjusted R², AIC, BIC, and holdout MSE), despite being a little slower to compute.

d).Fit a linear regression model to explain the “actual sales price” (V104) in terms of other variables (excluding the variable “actual construction costs” (V105)) using (i) Ridge regression (ii) LASSO regression with an appropriate 𝜆 determined by cross validation. Compare these two models in terms of outputs, computational time and cross validation mean square error.

# Use the same training/testing split as previously

# Set cross-validation parameters
set.seed(123)

# Create model matrix (excluding V105)
X_train <- model.matrix(V104 ~ . - V105, data = train_data)
y_train <- train_data$V104
X_test <- model.matrix(V104 ~ . - V105, data = test_data)
y_test <- test_data$V104

### ---- (i) Ridge Regression (alpha = 0) ---- ###
start_ridge <- Sys.time()
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
end_ridge <- Sys.time()
time_ridge <- end_ridge - start_ridge

ridge_best_lambda <- cv_ridge$lambda.min
best_ridge_mod <- glmnet(X_train, y_train, alpha = 0, lambda = ridge_best_lambda)

ridge_pred <- predict(best_ridge_mod, newx = X_test)
ridge_mse <- mean((ridge_pred - y_test)^2)

### ---- (ii) LASSO Regression (alpha = 1) ---- ###
start_lasso <- Sys.time()
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
end_lasso <- Sys.time()
time_lasso <- end_lasso - start_lasso

lasso_best_lambda <- cv_lasso$lambda.min
best_lasso_mod <- glmnet(X_train, y_train, alpha = 1, lambda = lasso_best_lambda)

lasso_pred <- predict(best_lasso_mod, newx = X_test)
lasso_mse <- mean((lasso_pred - y_test)^2)

### ---- Cross-validation plot  ---- ###
par(mfrow = c(1, 2))

plot(cv_ridge, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "Ridge: Cross-Validation", line = 2.5) 

plot(cv_lasso, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "LASSO: Cross-Validation", line = 2.5) 

par(mfrow = c(1, 1))

### --- Model Comparison Table --- ###
model_comparison <- data.frame(
  Model = c("Ridge", "LASSO","Backward", "Stepwise"),
  Time = c(time_ridge, time_lasso, backward_time, stepwise_time),
  `CV MSE` = c(min(cv_ridge$cvm), min(cv_lasso$cvm), NA, NA), 
  `Holdout MSE` = c(ridge_mse, lasso_mse, backward_mse, stepwise_mse),
  Predictors = c(length(coef(best_ridge_mod))-1, length(which(coef(best_lasso_mod)!=0))-1, 
                 length(coef(backward_model))-1, length(coef(stepwise_model))-1)
)

print(model_comparison)
##      Model           Time   CV.MSE Holdout.MSE Predictors
## 1    Ridge 0.1600900 secs 48080.05    92275.39        108
## 2    LASSO 0.0655899 secs 27436.48    61648.42         34
## 3 Backward 6.7059698 secs       NA    89101.99         43
## 4 Stepwise 8.7679069 secs       NA    69494.02         22

e). Please give a possible explanation of why one method is better than the other in this case,considering also results from backwards and stepwise selection.

From the comparison of the four models (Ridge, LASSO, Backward, and Stepwise selection), LASSO demonstrates the best overall performance in this case, and here’s why:

Lower Prediction Error:

LASSO achieves the lowest Holdout MSE (61648.42 ) compared to Ridge (92275.39), Backward (89101.99 ), and Stepwise (69494.02). This indicates LASSO generalizes better to unseen data.Its lower CV MSE (27436.48 ) (compared to Ridge’s 48080.05) further validates its robustness.

Computational Efficiency:

LASSO runs much faster (0.125 seconds) than Backward (9.823s) and Stepwise (14.149s), making it scalable for large datasets.Ridge (0.240s) is also fast but performs worse in prediction accuracy.

Variable Sparsity:

LASSO retains only 40 predictors (vs. Ridge’s 108 and Backward’s 43), reducing overfitting risk while maintaining interpretability.Ridge Model keeps all the predictors (108 here), but makes their effects smaller.Stepwise selects fewer predictors (22) but has higher Holdout MSE than LASSO, suggesting it might discard useful variables.

Why LASSO Outperforms Backward/Stepwise:

Automatic Feature Selection: LASSO’s L1 regularization shrinks irrelevant coefficients to exact zero, avoiding the “greedy” pitfalls of stepwise methods, which can get stuck in suboptimal variable subsets.

Handling Multicollinearity: LASSO and Ridge handle correlated predictors better than stepwise methods, which often fail in high-dimensional settings.

While Stepwise is simpler to interpret, its computational cost and higher error make it less practical here.Ridge retains all variables, leading to higher complexity and overfitting.LASSO balances accuracy, speed, and simplicity effectively, making it the superior choice for this dataset. Its ability to select key predictors while minimizing prediction error aligns well with the observed data patterns.

Question 2

Step 1: Data Preparation

# Load the Parkinson's dataset
parkinsons_data <- read.csv("../data/parkinsons.csv", header = TRUE)[-1]

# Function to split and scale data
prepare_data <- function(data, seed) {
  set.seed(seed)
  n <- nrow(data)
  train_indices <- sample(1:n, 30)
  
  train_data <- data[train_indices, ]
  test_data <- data[-train_indices, ]
  
  X_train <- as.matrix(train_data[, paste0("X", 1:97)])
  y_train <- train_data$UPDRS
  X_test <- as.matrix(test_data[, paste0("X", 1:97)])
  y_test <- test_data$UPDRS
  
  X_train_scaled <- scale(X_train)
  X_test_scaled <- scale(X_test, 
                         center = attr(X_train_scaled, "scaled:center"), 
                         scale = attr(X_train_scaled, "scaled:scale"))
  
  return(list(
    X_train = X_train_scaled,
    y_train = y_train,
    X_test = X_test_scaled,
    y_test = y_test
  ))
}

a). Confirm that a linear model can fit the training data exactly. Why is this model not going to be useful?

lm_data <- prepare_data(parkinsons_data, 123)

# Fit a linear model
lm_model <- lm(lm_data$y_train ~ lm_data$X_train)
#summary(lm_model)
html_summary(lm_model, title = "trial") 

trial

Call:
lm(formula = lm_data$y_train ~ lm_data$X_train)

Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!

Coefficients: (68 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)            26.14        NaN     NaN      NaN
lm_data$X_trainX1    -555.71        NaN     NaN      NaN
lm_data$X_trainX2     -34.59        NaN     NaN      NaN
lm_data$X_trainX3     585.96        NaN     NaN      NaN
lm_data$X_trainX4    -153.57        NaN     NaN      NaN
lm_data$X_trainX5    -599.82        NaN     NaN      NaN
lm_data$X_trainX6     -24.87        NaN     NaN      NaN
lm_data$X_trainX7      51.83        NaN     NaN      NaN
lm_data$X_trainX8     170.76        NaN     NaN      NaN
lm_data$X_trainX9    -225.87        NaN     NaN      NaN
lm_data$X_trainX10     28.81        NaN     NaN      NaN
lm_data$X_trainX11    173.40        NaN     NaN      NaN
lm_data$X_trainX12    -86.58        NaN     NaN      NaN
lm_data$X_trainX13  31994.20        NaN     NaN      NaN
lm_data$X_trainX14   6745.51        NaN     NaN      NaN
lm_data$X_trainX15   1379.28        NaN     NaN      NaN
lm_data$X_trainX16 -32628.15        NaN     NaN      NaN
lm_data$X_trainX17  -3149.37        NaN     NaN      NaN
lm_data$X_trainX18    188.29        NaN     NaN      NaN
lm_data$X_trainX19    713.14        NaN     NaN      NaN
lm_data$X_trainX20   -239.04        NaN     NaN      NaN
lm_data$X_trainX21   -283.17        NaN     NaN      NaN
lm_data$X_trainX22    278.35        NaN     NaN      NaN
lm_data$X_trainX23    489.80        NaN     NaN      NaN
lm_data$X_trainX24   -125.69        NaN     NaN      NaN
lm_data$X_trainX25 -32067.51        NaN     NaN      NaN
lm_data$X_trainX26  -6759.54        NaN     NaN      NaN
lm_data$X_trainX27  -1427.30        NaN     NaN      NaN
lm_data$X_trainX28  32511.82        NaN     NaN      NaN
lm_data$X_trainX29   3023.98        NaN     NaN      NaN
lm_data$X_trainX30        NA         NA      NA       NA
lm_data$X_trainX31        NA         NA      NA       NA
lm_data$X_trainX32        NA         NA      NA       NA
lm_data$X_trainX33        NA         NA      NA       NA
lm_data$X_trainX34        NA         NA      NA       NA
lm_data$X_trainX35        NA         NA      NA       NA
lm_data$X_trainX36        NA         NA      NA       NA
lm_data$X_trainX37        NA         NA      NA       NA
lm_data$X_trainX38        NA         NA      NA       NA
lm_data$X_trainX39        NA         NA      NA       NA
lm_data$X_trainX40        NA         NA      NA       NA
lm_data$X_trainX41        NA         NA      NA       NA
lm_data$X_trainX42        NA         NA      NA       NA
lm_data$X_trainX43        NA         NA      NA       NA
lm_data$X_trainX44        NA         NA      NA       NA
lm_data$X_trainX45        NA         NA      NA       NA
lm_data$X_trainX46        NA         NA      NA       NA
lm_data$X_trainX47        NA         NA      NA       NA
lm_data$X_trainX48        NA         NA      NA       NA
lm_data$X_trainX49        NA         NA      NA       NA
lm_data$X_trainX50        NA         NA      NA       NA
lm_data$X_trainX51        NA         NA      NA       NA
lm_data$X_trainX52        NA         NA      NA       NA
lm_data$X_trainX53        NA         NA      NA       NA
lm_data$X_trainX54        NA         NA      NA       NA
lm_data$X_trainX55        NA         NA      NA       NA
lm_data$X_trainX56        NA         NA      NA       NA
lm_data$X_trainX57        NA         NA      NA       NA
lm_data$X_trainX58        NA         NA      NA       NA
lm_data$X_trainX59        NA         NA      NA       NA
lm_data$X_trainX60        NA         NA      NA       NA
lm_data$X_trainX61        NA         NA      NA       NA
lm_data$X_trainX62        NA         NA      NA       NA
lm_data$X_trainX63        NA         NA      NA       NA
lm_data$X_trainX64        NA         NA      NA       NA
lm_data$X_trainX65        NA         NA      NA       NA
lm_data$X_trainX66        NA         NA      NA       NA
lm_data$X_trainX67        NA         NA      NA       NA
lm_data$X_trainX68        NA         NA      NA       NA
lm_data$X_trainX69        NA         NA      NA       NA
lm_data$X_trainX70        NA         NA      NA       NA
lm_data$X_trainX71        NA         NA      NA       NA
lm_data$X_trainX72        NA         NA      NA       NA
lm_data$X_trainX73        NA         NA      NA       NA
lm_data$X_trainX74        NA         NA      NA       NA
lm_data$X_trainX75        NA         NA      NA       NA
lm_data$X_trainX76        NA         NA      NA       NA
lm_data$X_trainX77        NA         NA      NA       NA
lm_data$X_trainX78        NA         NA      NA       NA
lm_data$X_trainX79        NA         NA      NA       NA
lm_data$X_trainX80        NA         NA      NA       NA
lm_data$X_trainX81        NA         NA      NA       NA
lm_data$X_trainX82        NA         NA      NA       NA
lm_data$X_trainX83        NA         NA      NA       NA
lm_data$X_trainX84        NA         NA      NA       NA
lm_data$X_trainX85        NA         NA      NA       NA
lm_data$X_trainX86        NA         NA      NA       NA
lm_data$X_trainX87        NA         NA      NA       NA
lm_data$X_trainX88        NA         NA      NA       NA
lm_data$X_trainX89        NA         NA      NA       NA
lm_data$X_trainX90        NA         NA      NA       NA
lm_data$X_trainX91        NA         NA      NA       NA
lm_data$X_trainX92        NA         NA      NA       NA
lm_data$X_trainX93        NA         NA      NA       NA
lm_data$X_trainX94        NA         NA      NA       NA
lm_data$X_trainX95        NA         NA      NA       NA
lm_data$X_trainX96        NA         NA      NA       NA
lm_data$X_trainX97        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 29 and 0 DF,  p-value: NA

The R-squared value of 1 indicates a perfect fit on the training data. However, this is likely due to overfitting, as the model is using 97 predictors with only 30 observations. In such cases, the model memorizes the data instead of learning meaningful relationships, which results in poor generalization to new data.

b). Now use the LASSO to fit the training data, using leave-one-out cross-validation to find the tuning parameter 𝜆. (This means 𝑛𝑓𝑜𝑙𝑑𝑠 = 30. You will also find it convenient to set 𝑔𝑟𝑖𝑑 = 10^𝑠𝑒𝑞(3,−1,100) and 𝑡ℎ𝑟𝑒𝑠ℎ = 1𝑒 −10). What is the optimal value of 𝜆? and what is the resulting test error?

c). State your final model for the UPDRS. How many features have been selected? What conclusions can you draw?

# General LASSO function to fit and return key results
lasso_regression <- function(X_train, y_train, X_test, y_test) {
  # Define lambda grid for tuning
  grid <- 10^seq(3, -1, length.out = 100)

  # Perform Leave-One-Out Cross-Validation
  lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = grid,
                        nfolds = nrow(X_train), thresh = 1e-10)

  # Extract the best lambda value
  optimal_lambda <- lasso_cv$lambda.min

  # Fit final LASSO model using the best lambda
  lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = optimal_lambda)

  # Predict on the test set and compute mean squared error
  y_pred <- predict(lasso_model, newx = X_test)
  test_error <- mean((y_test - y_pred)^2)

  # Get the names of selected (non-zero) features
  coef_full <- coef(lasso_model)
  selected_features <- rownames(coef_full)[coef_full[, 1] != 0][-1]

  # Build the final model formula as a string
  formula_terms <- sprintf("%.4f * %s", coef_full[selected_features, 1], selected_features)
  formula_str <- paste("UPDRS =", round(coef_full[1,1], 4), "+", paste(formula_terms, collapse = " + "))

  # Return results
  return(list(
    lambda = optimal_lambda,
    test_error = test_error,
    selected_features = selected_features,
    n_nonzero = length(selected_features),
    formula = formula_str,
    cv_plot = lasso_cv
  ))
}


# Prepare data and fit the model
data123 <- prepare_data(parkinsons_data, 123)
result123 <- lasso_regression(data123$X_train, data123$y_train, data123$X_test, data123$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation errors vs. log(lambda)
plot(result123$cv_plot)

# Print key results
cat("Best lambda:", result123$lambda, "\n")
## Best lambda: 1.023531
cat("Test set MSE:", result123$test_error, "\n")
## Test set MSE: 20.41405
cat("Number of selected features:", result123$n_nonzero, "\n")
## Number of selected features: 4
cat("Selected features:", paste(result123$selected_features, collapse = ", "), "\n")
## Selected features: X9, X82, X83, X97
cat("LASSO model equation:\n", result123$formula, "\n")
## LASSO model equation:
##  UPDRS = 26.1423 + 0.2039 * X9 + 0.0586 * X82 + 0.3937 * X83 + 7.0572 * X97

The LASSO regression identified a lambda value (≈1.02) that minimized cross-validated MSE, resulting in a model with only four non-zero coefficients: X9, X82, X83, and X97. This demonstrates strong feature selection capability, improving interpretability and reducing overfitting risks. The resulting formula list below: \[ \text{UPDRS} = 26.1423 + 0.2039 \times X_9 + 0.0586 \times X_{82} + 0.3937 \times X_{83} + 7.0572 \times X_{97} \] With LASSO, the final model includes only four key predictors, which significantly improves interpretability. The coefficients provide direct insight into how each selected feature contributes to the UPDRS score. This sparse model performs well (MSE ≈ 20.4), showing a good balance between accuracy and simplicity.

d). Repeat your analysis with a different random split of the training and test sets. Have the same features been selected in your final model?

# Prepare data with a different seed and fit the model
data321 <- prepare_data(parkinsons_data, 321)
result321 <- lasso_regression(data321$X_train, data321$y_train, data321$X_test, data321$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation result
plot(result321$cv_plot)

# Print key results
cat("Best lambda:", result321$lambda, "\n")
## Best lambda: 0.9326033
cat("Test set MSE:", result321$test_error, "\n")
## Test set MSE: 4.899441
cat("Number of selected features:", result321$n_nonzero, "\n")
## Number of selected features: 3
cat("Selected features:", paste(result321$selected_features, collapse = ", "), "\n")
## Selected features: X83, X95, X97
cat("LASSO model equation:\n", result321$formula, "\n")
## LASSO model equation:
##  UPDRS = 25.4929 + 0.2028 * X83 + 0.2533 * X95 + 9.0117 * X97
# Build a comparison table of the two models
comparison_table <- data.frame(
  Seed = c(123, 321),
  Optimal_Lambda = c(result123$lambda, result321$lambda),
  Test_MSE = c(result123$test_error, result321$test_error),
  Num_Selected_Features = c(result123$n_nonzero, result321$n_nonzero),
  Selected_Features = c(
    paste(result123$selected_features, collapse = ", "),
    paste(result321$selected_features, collapse = ", ")
  )
)

# Display the table 
print(comparison_table)
##   Seed Optimal_Lambda  Test_MSE Num_Selected_Features Selected_Features
## 1  123      1.0235310 20.414054                     4 X9, X82, X83, X97
## 2  321      0.9326033  4.899441                     3     X83, X95, X97

When We repeated the analysis with a different random seed (from 123 to 321), which changed the training-test split. The test set Mean Squared Error (MSE) for the previous dataset was 20.414054, while for the new dataset, it decreased to 4.899441. This reduction indicates that the model trained with the new dataset performs better on the test set.

The final model selected a slightly different set of features. Under seed = 123, the selected features were: X9, X82, X83, X97 Under seed = 321, the selected features were: X83, X95, X97

While some features like X83 and X97 appeared in both models, others (e.g., X9, X82 and X95) were different. This indicates that the feature selection process is sensitive to the specific training-test split and that small changes in data partitioning can lead to different optimal subsets.

The final model formula list below: \[ \text{UPDRS} = 25.4929 + 0.2028 \times X_{83} + 0.2533 \times X_{95} + 9.0117 \times X_{97} \]

Question 3

a). Train an ElasticNet model to predict MEAN_ANNUAL_RAINFALL using 10-fold cross-validation to optimize values for 𝛼 and 𝜆.

  1. Data Preparation and Splitting
# Load the data
weather_data <- read.csv("../data/Weather_Station_Data_v1.csv")

# Set seed for reproducibility
set.seed(321)

# Split the data into training and test sets
train_indices <- sample(1:nrow(weather_data), size = 0.8 * nrow(weather_data))
train_data <- weather_data[train_indices, ]
test_data <- weather_data[-train_indices, ]

# Define predictors and response
X_train <- as.matrix(train_data[, -which(names(train_data) == "MEAN_ANNUAL_RAINFALL")])
y_train <- train_data$MEAN_ANNUAL_RAINFALL
X_test <- as.matrix(test_data[, -which(names(test_data) == "MEAN_ANNUAL_RAINFALL")])
y_test <- test_data$MEAN_ANNUAL_RAINFALL
  1. ElasticNet Model with Cross-Validation
# Define a sequence of alpha values
alphas <- seq(0, 1, by = 0.1)

# Perform cross-validation for each alpha
cv_results <- lapply(alphas, function(a) {
  cv.glmnet(X_train, y_train, alpha = a, nfolds = 10)
})

# Find the best model based on minimum cross-validated error
best_alpha_idx <- which.min(sapply(cv_results, function(cv) min(cv$cvm)))
best_cv_mod <- cv_results[[best_alpha_idx]]
best_alpha <- alphas[best_alpha_idx]

cat("Best alpha selected is:", best_alpha, "\n")
## Best alpha selected is: 0.3

Base on the cross validation, the best alpha of Elastic Net in this dataset is 0.3, in the next step, the alpha will be set to 0.3 to obtain the best result.

b). Plot the cross-validation results for your model

lambda_min <- best_cv_mod$lambda.min
lambda_1se <- best_cv_mod$lambda.1se

# Plot the cross-validation results
plot(best_cv_mod, main = "", xlab = "Log(lambda)", ylab = "Mean Squared Error")
title(main = "ElasticNet Cross-Validation Results", line = 2)
abline(v = log(lambda_min), col = "green", lty = 2)
abline(v = log(lambda_1se), col = "red", lty = 2)
legend("topright", legend = c("lambda.min", "lambda.1se"),
       col = c("green", "red"), lty = 2, bty = "n")

cat("The value of Lambda.min is: ", lambda_min, "\n")
## The value of Lambda.min is:  0.3456271
cat("The value of Lambda.1se is: ", lambda_1se, "\n")
## The value of Lambda.1se is:  20.7198

The plot shows the mean squared error (MSE) across different values of lambda. The green line represents lambda.min (with lowest MSE), and the red line is lambda.1se (within one standard error for simpler model).

c). Make predictions on your test set using both lambda.min and lambda.1se, show the coeff icients of each model, report the MSE and/or RMSE of both, and the number of predictors used in each model.

# Predictions
pred_min <- predict(best_cv_mod, s = "lambda.min", newx = X_test)
pred_1se <- predict(best_cv_mod, s = "lambda.1se", newx = X_test)

# Calculate MSE and RMSE
mse_min <- mean((y_test - pred_min)^2)
rmse_min <- sqrt(mse_min)

mse_1se <- mean((y_test - pred_1se)^2)
rmse_1se <- sqrt(mse_1se)

# Count non-zero coefficients
coef_min <- coef(best_cv_mod, s = "lambda.min")
coef_1se <- coef(best_cv_mod, s = "lambda.1se")
n_pred_min <- sum(coef_min != 0) - 1  # exclude intercept
n_pred_1se <- sum(coef_1se != 0) - 1

# Create summary table
comparison_table <- data.frame(
  Model = c("lambda.min", "lambda.1se"),
  Alpha = c(best_alpha, best_alpha),
  Lambda = c(lambda_min, lambda_1se),
  MSE = c(mse_min, mse_1se),
  RMSE = c(rmse_min, rmse_1se),
  Predictors_Used = c(n_pred_min, n_pred_1se)
)

print(comparison_table)
##        Model Alpha     Lambda      MSE     RMSE Predictors_Used
## 1 lambda.min   0.3  0.3456271 11412.89 106.8311              13
## 2 lambda.1se   0.3 20.7198017 13078.75 114.3624               9
# Extract and display coefficients
coef_table_min <- data.frame(
  Variable = rownames(as.matrix(coef_min)),
  Coefficient = round(as.vector(coef_min), 4)
)

coef_table_1se <- data.frame(
  Variable = rownames(as.matrix(coef_1se)),
  Coefficient = round(as.vector(coef_1se), 4)
)

cat("Coefficients for lambda.min:\n")
## Coefficients for lambda.min:
print(coef_table_min)
##                  Variable Coefficient
## 1             (Intercept)   1043.8279
## 2                ALTITUDE      0.1367
## 3    MEAN_ANNUAL_AIR_TEMP    -44.2016
## 4   MEAN_MONTHLY_MAX_TEMP     22.8969
## 5   MEAN_MONTHLY_MIN_TEMP     -5.3594
## 6  MEAN_ANNUAL_WIND_SPEED    -14.1155
## 7        MEAN_CLOUD_COVER      3.7076
## 8    MEAN_ANNUAL_SUNSHINE     -0.0120
## 9  MAX_MONTHLY_WIND_SPEED     -8.1163
## 10           MAX_AIR_TEMP    -29.2842
## 11         MAX_WIND_SPEED     -0.2383
## 12           MAX_RAINFALL     20.6347
## 13           MIN_AIR_TEMP     27.2047
## 14    MEAN_RANGE_AIR_TEMP     22.3804
cat("Coefficients for lambda.1se:\n")
## Coefficients for lambda.1se:
print(coef_table_1se)
##                  Variable Coefficient
## 1             (Intercept)    612.3961
## 2                ALTITUDE      0.1671
## 3    MEAN_ANNUAL_AIR_TEMP     -4.9645
## 4   MEAN_MONTHLY_MAX_TEMP      0.0000
## 5   MEAN_MONTHLY_MIN_TEMP     -1.0373
## 6  MEAN_ANNUAL_WIND_SPEED     -8.0770
## 7        MEAN_CLOUD_COVER      1.6533
## 8    MEAN_ANNUAL_SUNSHINE     -0.0031
## 9  MAX_MONTHLY_WIND_SPEED      0.0000
## 10           MAX_AIR_TEMP    -16.0839
## 11         MAX_WIND_SPEED      0.0000
## 12           MAX_RAINFALL     19.1592
## 13           MIN_AIR_TEMP      9.3296
## 14    MEAN_RANGE_AIR_TEMP      0.0000

d).Which model would you choose? Justify your answer.

Based on the comparison of the two models using lambda.min and lambda.1se,We evaluated two elastic net regression models with alpha = 0.3, using lambda.min and lambda.1se. The model with lambda.min had a lower MSE of 11,412.89 and RMSE of 106.83, while the lambda.1se model had a slightly higher MSE of 13,078.75 and RMSE of 114.36.

In terms of complexity, the lambda.min model used 13 predictors, whereas the lambda.1se model used only 9 predictors, indicating stronger regularization.

This shows a typical trade-off: lambda.min provides better predictive accuracy, but with more variables in the model. lambda.1se gives a more parsimonious model (fewer predictors), which may be preferred for simplicity or interpretability, even at the cost of slightly higher error.

In real-world applications, especially when working with environmental or climate-related data, interpretability and generalizability are often more important than marginal gains in accuracy. A simpler model is also easier to communicate and deploy.

Therefore, the lambda.1se model strikes a better balance between performance and model simplicity.

The formula for the final model is as follows: \[ \begin{aligned} \hat{Y} =\ & 612.3961 \\ &+ 0.1671 \cdot \text{ALTITUDE} \\ &- 4.9645 \cdot \text{MEAN\_ANNUAL\_AIR\_TEMP} \\ &- 1.0373 \cdot \text{MEAN\_MONTHLY\_MIN\_TEMP} \\ &- 8.0770 \cdot \text{MEAN\_ANNUAL\_WIND\_SPEED} \\ &+ 1.6533 \cdot \text{MEAN\_CLOUD\_COVER} \\ &- 0.0031 \cdot \text{MEAN\_ANNUAL\_SUNSHINE} \\ &- 16.0839 \cdot \text{MAX\_AIR\_TEMP} \\ &+ 19.1592 \cdot \text{MAX\_RAINFALL} \\ &+ 9.3296 \cdot \text{MIN\_AIR\_TEMP} \end{aligned} \]